plot_ly() and ggplotly() functionsplot_geo()We will work with COVID data downloaded from the New York Times. The dataset consists of COVID-19 cases and deaths in each US state during the course of the COVID epidemic.
The objective of this lab is to explore relationships between cases, deaths, and population sizes of US states, and plot data to demonstrate this
cv_states_readin <- as.data.frame(data.table::fread("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv") )
state_pops <- as.data.frame(data.table::fread("https://raw.githubusercontent.com/COVID19Tracking/associated-data/master/us_census_data/us_census_2018_population_estimates_states.csv"))
state_pops$abb <- state_pops$state
state_pops$state <- state_pops$state_name
state_pops$state_name <- NULL
cv_states <- merge(cv_states_readin, state_pops, by="state")
head, and tail of the datadim(cv_states)
## [1] 40674 9
head(cv_states)
## state date fips cases deaths geo_id population pop_density abb
## 1 Alabama 2022-03-11 1 1288999 18832 1 4887871 96.50939 AL
## 2 Alabama 2022-01-21 1 1120881 16824 1 4887871 96.50939 AL
## 3 Alabama 2022-03-10 1 1288454 18766 1 4887871 96.50939 AL
## 4 Alabama 2021-07-17 1 559478 11443 1 4887871 96.50939 AL
## 5 Alabama 2021-06-22 1 549013 11311 1 4887871 96.50939 AL
## 6 Alabama 2022-04-17 1 1297869 19502 1 4887871 96.50939 AL
tail(cv_states)
## state date fips cases deaths geo_id population pop_density abb
## 40669 Wyoming 2021-04-25 56 57696 705 56 577737 5.950611 WY
## 40670 Wyoming 2021-09-30 56 90602 996 56 577737 5.950611 WY
## 40671 Wyoming 2021-08-27 56 73467 835 56 577737 5.950611 WY
## 40672 Wyoming 2020-10-19 56 9311 57 56 577737 5.950611 WY
## 40673 Wyoming 2021-04-26 56 57818 705 56 577737 5.950611 WY
## 40674 Wyoming 2020-07-28 56 2589 26 56 577737 5.950611 WY
str(cv_states)
## 'data.frame': 40674 obs. of 9 variables:
## $ state : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ date : IDate, format: "2022-03-11" "2022-01-21" ...
## $ fips : int 1 1 1 1 1 1 1 1 1 1 ...
## $ cases : int 1288999 1120881 1288454 559478 549013 1297869 42862 1632 1287822 792632 ...
## $ deaths : int 18832 16824 18766 11443 11311 19502 1007 44 18694 14155 ...
## $ geo_id : int 1 1 1 1 1 1 1 1 1 1 ...
## $ population : int 4887871 4887871 4887871 4887871 4887871 4887871 4887871 4887871 4887871 4887871 ...
## $ pop_density: num 96.5 96.5 96.5 96.5 96.5 ...
## $ abb : chr "AL" "AL" "AL" "AL" ...
cv_states$date <- as.Date(cv_states$date, format="%Y-%m-%d")
state_list <- unique(cv_states$state)
cv_states$state <- factor(cv_states$state, levels = state_list)
abb_list <- unique(cv_states$abb)
cv_states$abb <- factor(cv_states$abb, levels = abb_list)
cv_states = cv_states[order(cv_states$state, cv_states$date),]
str(cv_states)
## 'data.frame': 40674 obs. of 9 variables:
## $ state : Factor w/ 52 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ date : Date, format: "2020-03-13" "2020-03-14" ...
## $ fips : int 1 1 1 1 1 1 1 1 1 1 ...
## $ cases : int 6 12 23 29 39 51 78 106 131 157 ...
## $ deaths : int 0 0 0 0 0 0 0 0 0 0 ...
## $ geo_id : int 1 1 1 1 1 1 1 1 1 1 ...
## $ population : int 4887871 4887871 4887871 4887871 4887871 4887871 4887871 4887871 4887871 4887871 ...
## $ pop_density: num 96.5 96.5 96.5 96.5 96.5 ...
## $ abb : Factor w/ 52 levels "AL","AK","AZ",..: 1 1 1 1 1 1 1 1 1 1 ...
head(cv_states)
## state date fips cases deaths geo_id population pop_density abb
## 376 Alabama 2020-03-13 1 6 0 1 4887871 96.50939 AL
## 184 Alabama 2020-03-14 1 12 0 1 4887871 96.50939 AL
## 16 Alabama 2020-03-15 1 23 0 1 4887871 96.50939 AL
## 161 Alabama 2020-03-16 1 29 0 1 4887871 96.50939 AL
## 82 Alabama 2020-03-17 1 39 0 1 4887871 96.50939 AL
## 275 Alabama 2020-03-18 1 51 0 1 4887871 96.50939 AL
tail(cv_states)
## state date fips cases deaths geo_id population pop_density abb
## 39956 Wyoming 2022-04-17 56 156258 1801 56 577737 5.950611 WY
## 40133 Wyoming 2022-04-18 56 156258 1801 56 577737 5.950611 WY
## 40541 Wyoming 2022-04-19 56 156392 1807 56 577737 5.950611 WY
## 40530 Wyoming 2022-04-20 56 156392 1807 56 577737 5.950611 WY
## 40386 Wyoming 2022-04-21 56 156392 1807 56 577737 5.950611 WY
## 40561 Wyoming 2022-04-22 56 156392 1807 56 577737 5.950611 WY
head(cv_states)
## state date fips cases deaths geo_id population pop_density abb
## 376 Alabama 2020-03-13 1 6 0 1 4887871 96.50939 AL
## 184 Alabama 2020-03-14 1 12 0 1 4887871 96.50939 AL
## 16 Alabama 2020-03-15 1 23 0 1 4887871 96.50939 AL
## 161 Alabama 2020-03-16 1 29 0 1 4887871 96.50939 AL
## 82 Alabama 2020-03-17 1 39 0 1 4887871 96.50939 AL
## 275 Alabama 2020-03-18 1 51 0 1 4887871 96.50939 AL
summary(cv_states)
## state date fips cases
## Washington : 823 Min. :2020-01-21 Min. : 1.00 Min. : 1
## Illinois : 820 1st Qu.:2020-09-13 1st Qu.:16.00 1st Qu.: 54036
## California : 819 Median :2021-03-27 Median :29.00 Median : 228202
## Arizona : 818 Mean :2021-03-27 Mean :29.78 Mean : 574982
## Massachusetts: 812 3rd Qu.:2021-10-09 3rd Qu.:44.00 3rd Qu.: 704763
## Wisconsin : 808 Max. :2022-04-22 Max. :72.00 Max. :9193232
## (Other) :35774
## deaths geo_id population pop_density
## Min. : 0 Min. : 1.00 Min. : 577737 Min. : 1.292
## 1st Qu.: 959 1st Qu.:16.00 1st Qu.: 1805832 1st Qu.: 43.659
## Median : 3804 Median :29.00 Median : 4468402 Median : 107.860
## Mean : 9199 Mean :29.78 Mean : 6417168 Mean : 422.754
## 3rd Qu.:11237 3rd Qu.:44.00 3rd Qu.: 7535591 3rd Qu.: 229.511
## Max. :90090 Max. :72.00 Max. :39557045 Max. :11490.120
## NA's :771
## abb
## WA : 823
## IL : 820
## CA : 819
## AZ : 818
## MA : 812
## WI : 808
## (Other):35774
# The earliest date:
min(cv_states$date)
## [1] "2020-01-21"
# The latest date:
max(cv_states$date)
## [1] "2022-04-22"
# The lowest case count:
min(cv_states$cases)
## [1] 1
# The highest case count:
max(cv_states$cases)
## [1] 9193232
# The lowest death count:
min(cv_states$deaths)
## [1] 0
# The highest death count:
max(cv_states$deaths)
## [1] 90090
new_cases and new_deaths and correct outliersnew_cases, and new deaths, new_deaths:
new_cases equal to the difference between cases on date i and date i-1, starting on date i=2for (i in 1:length(state_list)) {
cv_subset = subset(cv_states, state == state_list[i])
cv_subset = cv_subset[order(cv_subset$date),]
# add starting level for new cases and deaths
cv_subset$new_cases = cv_subset$cases[1]
cv_subset$new_deaths = cv_subset$deaths[1]
for (j in 2:nrow(cv_subset)) {
cv_subset$new_cases[j] = cv_subset$cases[j] - cv_subset$cases[j-1]
cv_subset$new_deaths[j] = cv_subset$deaths[j] - cv_subset$deaths[j-1]
}
cv_states$new_cases[cv_states$state==state_list[i]] = cv_subset$new_cases
cv_states$new_deaths[cv_states$state==state_list[i]] = cv_subset$new_deaths
}
# Focus on recent dates
cv_states <- cv_states %>% dplyr::filter(date >= "2021-07-01")
ggplotly for EDA: See if there are outliers or values that don’t make sense for new_cases and new_deaths. Which states and which dates have strange values?p1 <- ggplot(cv_states, aes(x = date, y = new_cases, colour = state)) +
geom_line()
ggplotly(p1)
p2 <- ggplot(cv_states, aes(x = date, y = new_deaths, colour = state)) +
geom_line()
ggplotly(p2)
Correct outliers: Set negative values for new_cases or new_deaths to 0
Inspect data again interactively
cv_states$new_cases[cv_states$new_cases < 0] = 0
p3a <- ggplot(cv_states, aes(x = date, y = new_cases, colour = state)) +
geom_line() +
geom_point(size = 0.5, alpha = 0.5)
ggplotly(p3a)
cv_states$new_deaths[cv_states$new_deaths < 0] = 0
p3b <- ggplot(cv_states, aes(x = date, y = new_deaths, colour = state)) +
geom_line() +
geom_point(size = 0.5, alpha = 0.5)
ggplotly(p3b)
Add population-normalized (by 100,000) variables for each variable type (rounded to 1 decimal place). Make sure the variables you calculate are in the correct format (numeric). You can use the following variable names:
per100k = cases per 100,000 populationnewper100k= new cases per 100,000deathsper100k = deaths per 100,000newdeathsper100k = new deaths per 100,000Add a “naive CFR” variable representing deaths / cases on each date for each state
Create a dataframe representing values on the most recent date, cv_states_today
cv_states$per100k = as.numeric(format(round(cv_states$cases/(cv_states$population/100000),1),nsmall=1))
cv_states$newper100k = as.numeric(format(round(cv_states$new_cases/(cv_states$population/100000),1),nsmall=1))
cv_states$deathsper100k = as.numeric(format(round(cv_states$deaths/(cv_states$population/100000),1),nsmall=1))
cv_states$newdeathsper100k = as.numeric(format(round(cv_states$new_deaths/(cv_states$population/100000),1),nsmall=1))
cv_states = cv_states %>% mutate(naive_CFR = round((deaths*100/cases),2))
max_date <- max(cv_states$date)
cv_states_today = cv_states %>% filter(date==as.Date(max_date))
plot_ly()plot_ly() representing pop_density vs. various variables (e.g. cases, per100k, deaths, deathsper100k) for each state on most recent date (cv_states_today)
cv_states_today %>%
plot_ly(x = ~pop_density, y = ~cases, type = 'scatter', mode = "markers",
color = ~state, size = ~population, sizes = c(5, 70),
marker = list(sizemode = 'diameter', opacity = 0.5))
cv_states_today %>%
filter(state != "District of Columbia") %>%
plot_ly(x = ~pop_density, y = ~cases, color = ~state, type = "scatter", mode = "markers",
size = ~population, sizes = c(5, 70),
marker = list(sizemode = 'diameter', opacity = 0.5))
hovermode = "compare"cv_states_today %>%
filter(state != "District of Columbia") %>%
na.omit(state) %>%
plot_ly(x = ~pop_density, y = ~cases, type = "scatter", mode = "markers",
color = ~state, size = ~population, sizes = c(5, 70),
marker = list(sizemode = 'diameter', opacity = 0.5),
hover_info = "text",
text = ~paste(paste0("State:", state),
paste0("Cases per 100k:", per100k),
paste0("Deaths per 100k:", deathsper100k),
sep = "<br>")
) %>%
layout(title = "Population-normalized cases per 100k",
yaxis = list(title = "cases per 100k"),
xaxis = list(title = "population density"),
hovermode = "compare")
ggplotly()pop_density vs. newdeathsper100k create a chart with the same variables using gglotly()pop_density correlates with newdeathsper100k?p4 <- cv_states_today %>%
filter(state != "District of Columbia") %>%
ggplot(aes(x = pop_density, y = deathsper100k, colour = state, size = population)) +
geom_point()
ggplotly(p4)
It appears as though the plot depicts that there is a somewhat positive correlation between population density and new deaths per 100,000. This doesn’t seem too unusual since it would make sense for states with greater population densities to experience greater spread of the virus and thus incur a higher death rate.
naive_CFR for all states over time using plot_ly()
naive_CFR for the states that had an increase in September. How have they changed over time?new_cases and new_deaths together in one plot. Hint: use add_layer()
cv_states %>%
plot_ly(x = ~date, y = ~naive_CFR, color = ~state, type = "scatter", mode = "lines")
cv_states %>%
filter(state == "Florida") %>%
plot_ly(x = ~date, y = ~new_cases, type = "scatter", mode = "lines", name = "cases") %>%
add_lines(x = ~date, y = ~new_deaths, type = "scatter", mode = "lines", name = "deaths")
By looking at and hovering over the charts above, we can see that the approximate delay between spikes in both new cases and new deaths is 1 week, or 7 days. This makes sense since the incubation period of the virus is around 7-14 days, so new cases and new deaths will follow a similar pattern of spiking around that same timeframe.
Create a heatmap to visualize new_cases for each state on each date greater than July 1st, 2021 - Start by mapping selected features in the dataframe into a matrix using the tidyr package function pivot_wider(), naming the rows and columns, as done in the lecture notes - Use plot_ly() to create a heatmap out of this matrix. Which states stand out?
cv_states_mat <- cv_states %>%
select(state, date, new_cases) %>%
filter(date > "2021-07-01")
cv_states_mat2 <- as.data.frame(pivot_wider(cv_states_mat,
names_from = state,
values_from = new_cases))
cv_states_mat2 <- cv_states_mat2 %>%
column_to_rownames("date") %>%
as.matrix()
plot_ly(x = colnames(cv_states_mat2),
y = rownames(cv_states_mat2),
z = ~cv_states_mat2,
type = "heatmap")
In the heatmap above, the state of California stands out as the only state that had a time period where a significant spike in new cases (likely fueled by the Omicron variant) occurred.
Lab 10b questions 1-2, lab 11 questions 0-10. Upload html or pdf for both lab Rmd’s to quercus.